Ticketmaster was recently in the news after a major system meltdown when tickets for Taylor Swift’s latest concert tour went on sale. Response from Taylor Swift’s listenership — a population of such size and economic influence it probably qualifies for its own spot in the United Nations — crashed the ticket sales platform under the sheer weight of demand.
Thousands of furious fans vented their frustration on social media as they waited for hours in a virtual buying queue, only to be kicked out. Others finally got in, only to watch tickets disappear from their carts before they could check out.
In the understatement of the century, Ticketmaster responded that their platform “experienced some unexpected difficulties'' when 14 million simultaneous users (and bots) tried to buy tickets. The number of requests ballooned to a staggering 3.5 billion — four times the peak of any previous ticket sale. The company’s explanation did little to shield it from ongoing damage, though. Members of Congress began discussing (re)opening an antitrust investigation into Ticketmaster’s alleged monopolistic behavior, and vengeful Swifties have now filed a class action lawsuit (never mess with mother-daughter bonding rituals).
The truth is, this isn’t the first large company to experience such a brutally public system failure — and, so long as businesses rely on technology, it certainly won’t be the last. And, as Ticketmaster just found out, it can have serious and significant long-tail negative impact: government investigations, customer lawsuits, your business or product name becoming synonymous with failure. (Drinking a New Coke while listening to tunes on your Zune, anyone?).
Swifties schooled every business in the world on an important lesson: You screw up an experience like this, it’s dangerous. So, what lessons can companies of any size take away from the TayTay-Ticketmaster meltdown when it comes to preventing their own version of an epic public facepalm?
Every time a high visibility meltdown of this magnitude happens (Amazon Black Friday, 2018; Facebook’s IPO, 2012; Walmart/Gamestop/Target XBox Series X release, 2020), companies claim There was unprecedented demand on our system! as justification.
This is simply BS. No matter what product or service your company slings, your entire reason for existing is to meet customer demand. Which means your job is to build operational resilience and inherent scalability into your system so it can handle whatever comes — not just your best guess at what might come.
In Ticketmaster’s case, there was no choreographed attack trying to take their system down. Instead bots, in large volumes, attempted to buy Taylor Swift tickets in bulk. The sheer traffic from both bots and fans trying to buy tickets is what brought down TicketMaster’s servers. Ultimately, TicketMaster failed to cater to demand.
Still got scars on our brand so don’t think it’s in the past / These kind of wounds, they last and they last…
The Ticketmaster/T-Swift meltdown is but the latest object lesson in capacity planning: Don’t wait until a surge event to figure out how much traffic is too much traffic. Building scalable systems that can handle spiky workloads requires consideration and planning, not just some vague notion that you’ll just throw money at more servers in the event you unexpectedly hit number one in the App Store (or in any other way suddenly succeed beyond your wildest dreams).
Selecting the right application architecture, components, and services requires careful consideration of the infrastructure, caching layers, APIs, database, and more. Putting the right pieces together in the right way will automatically distribute the load and allow you to surf even the most titanic of traffic tsunamis without wiping out.
I laid the groundwork / and then just like clockwork / new nodes spun up in a line…
Ticketmaster, in recent years, has come to control 80% of the market for primary tickets to significant concerts. As one of the only games in town, experts say, the company has had little incentive to innovate or address technical debt. By coasting without making improvements to its systems, Ticketmaster had a hand in its own catastrophe. See above, selecting the right application architecture, vs. resting on “good enough” and hoping for the best.
Legacy systems only want data if it’s torture / Don’t say I didn’t, say I didn’t warn ya…
Common capacity planning is to brace for a threshold of 10 billion system calls. Ticketmaster reported that Taylor Swift ticket sales traffic peaked at 3.5 billion. This could indicate that their system had an unrecognized bottleneck that caused cascading failures throughout the system when 14 million customers all tried to squeeze through the same door at the same time.
Are you tired of waiting, wondering if your system call is ever coming around…?
Not the T-Swift tickets pre-sale waiting room mayhem variety, but chaos engineering. Surge events increase the likelihood of some backend service having a degradation or outage for many possible reasons: flooded network capacity, spiked CPU, storage quota exceeded, etc. Pushing teams to deal with forced system failures uncover weaknesses (see number 4, bottlenecks), and practicing worst case scenario drills decreases RTO in a real world SHTF event.
There’s no time for tears / I’m just sitting here planning my disaster recovery…
If a massive public outage strikes your product/service/platform and you just never quiiiite got around to meaningful capacity planning or disaster recovery strategies, well…Maybe build a time machine and travel back to when you could do the good things that would keep The Very Bad Event from happening?
I think I’ve seen this film before, and I didn’t like the ending…
The ultimate takeaway from the Taylor Swift-Ticketmaster meltdown is, don’t let legacy tech become the anti-hero. Investigating technical debt, checking for bottlenecks, implementing forced failure drills, evolving to scalable infrastructure: These are actions that will help keep your systems healthy, your users happy, and you from being the star of a post-event retro where the theme song is,
“Hi, it’s me, I’m the problem it’s me. At stand up, all the engineers agree…"
Disasters happen. (Most) outages shouldn’t.
When multi-billion dollar companies like Zoom, Slack, and Fanduel experience …
Read moreIt would be wrong to begin a comparison blog post about PostgreSQL without first acknowledging that it is one of the …
Read moreMy current employer uses sharded and replicated Postgres via RDS. Even basic things like deploying schema changes to …
Read more